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Abstract When deahng with classical spike train analysis, the practitioner often per- 
forms goodness-of-fit tests to test whether the observed process is a Poisson process, 
for instance, or if it obeys another type of probabilistic model (Yana et al. in Bio- 
phys. J. 46(3):323-330, 1984; Brown et al. in Neural Comput. 14(2):325-346, 2002; 
Pouzat and Chaffiol in Technical report, arXiv;0909.2785, 2009). In doing so, there is 
a fundamental plug-in step, where the parameters of the supposed underlying model 
are estimated. The aim of this article is to show that plug-in has sometimes very un- 
desirable effects. We propose a new method based on subsampling to deal with those 
plug-in issues in the case of the Kolmogorov-Smirnov test of uniformity. The method 
relies on the plug-in of good estimates of the underlying model that have to be consis- 
tent with a controlled rate of convergence. Some nonparametric estimates satisfying 
those constraints in the Poisson or in the Hawkes framework are highlighted. More- 
over, they share adaptive properties that are useful from a practical point of view. We 
show the performance of those methods on simulated data. We also provide a com- 
plete analysis with these tools on single unit activity recorded on a monkey during a 
sensory-motor task. 
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1 Introduction 

In neuroscience, the action potentials (spikes) are the main components for the real- 
time information processing in the brain. Moreover, it is possible to record in vivo 
several neurons and to have access to simultaneous spike trains. The duration of each 
spike is very small, about one millisecond. Moreover, the number and the position 
of each spike fluctuate from one trial to another trial. It is consequently quite natural 
to assimilate a spike to a random event. Therefore, in this article, we mathematically 
model spike trains as real-valued point processes that have been deeply described 
and studied for a long time in the literature (see [4] for a review) and often used in 
neuroscience (see, for instance, [2] and the references therein). However, except in 
very particular tests of independence (see, for instance, [5, 6]), it is most of the time 
necessary to describe spike trains as realizations of particular stochastic processes. 

Most of the analyses start by answering a standard basic question. Is the process 
an homogeneous Poisson process or not? See, for instance, [7-9]. Indeed, for this 
simple model, extensively used in neuroscience, there is only one parameter to infer, 
namely the firing rate. The study of firing rates in neuroscience has lead to significa- 
tive advances in the understanding of the coding of the direction of movements [10] 
for instance. But most of the time, spikes trains are more complex than homogeneous 
Poisson processes. Various studies have exhibited different kinds of correlations be- 
tween some motor, sensory, or cognitive events in a behaving animal and a variation 
of the firing rate of specific neurons, before, during or after this event [11, 12]. In 
particular, such data cannot be stationary. So, constraints on the previous model are 
relaxed and processes can be assumed to be inhomogeneous Poisson processes. In 
this setting, the firing rate is now time-dependent and is modeled by a function A(-), 
which is the intensity of the inhomogeneous Poisson process (see [8, 9]). Several 
studies have also established statistical evidence of dependence between the occur- 
rences of the spikes of several neurons (see [5, 6, 13-15]) or even within a given 
neuron. In this case, standard homogeneous or inhomogeneous Poisson processes 
cannot be used and models based on univariate or multivariate Hawkes processes 
or variations of them are quite natural to capture dependence of spikes occurrences 
[16-21]. Hawkes processes, extensively described and discussed later on, generalize 
homogeneous Poisson processes by using functions quantifying interactions between 
spikes. These functions are called interaction functions. Such interaction functions 
are used in neuroscience to model excitation and inhibition phenomena [22]. 

Whatever the chosen model, this model has to be tested before any other inference 
based on this model. A plug-in step to infer unknown parameters is most of the time 
unavoidable to perform these tests. More precisely, for general models on point pro- 
cesses, the main ingredient consists in transforming the data so that the time changed 
process becomes a homogeneous Poisson process, fact which can be easily tested. 
However, the parameters of the transformation are usually unknown and are replaced 
by estimates. This plug-in trick has been widely popularized since [23]. It is widely 
used in neuroscience since [1] (see also the textbook of Tuckwell [24], [3], or [2]). 
The main goal of this article is to precisely show that the plug-in step may sometimes 
lead to undesirable effects and to propose the subsampling as a reasonable and quite 
universal solution. We focus here on the Kolmogorov-Smirnov (K.S.) test of unifor- 
mity. Indeed this K.S. test is usually considered as one of the three main tests on 
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the first-order statistics that can be done to test the homogeneous Poisson hypothesis 
(see [1] and the references therein). More generally, the K.S. test (see [25] for its first 
use up to our knowledge) is one of the main omnibus tests [26], meaning that it is 
effective against a wide class of alternatives. However, it is known that a plug-in has 
to be taken with care for this test (see [27] for some brief discussion of this point). 
By using aggregated or cumulated tests, we propose 5 tests based on subsampling as 
goodness-of-fit tests, for which plug-in issues are solved. Note that, in neuroscience, 
plug-in problems have already been emphasized for other types of tests, namely the 
independence tests [22] . 

The second goal of this paper results from the first one: We have to develop statisti- 
cal methods in the setting of point processes to estimate functions such as the intensity 
for the Poisson model or the interaction functions for the Hawkes model. Standard 
statistical procedures consist in assuming that these functions are parameterized by 
a few number of parameters, and in taking (for instance) the maximum likelihood 
estimator [28, 29]. This approach is called parametric. For instance, assuming that 
a spike train is an homogeneous Poisson process, is equivalent to parameterizing the 
intensity by one parameter, namely the fixed constant firing rate. However, in neuro- 
science, except in the particular case of the homogeneous Poisson process, there is no 
a priori parametric shape for the functions to be estimated. These functions are most 
of the time unknown. Our second main contribution consists in proposing estimation 
procedures in a very flexible setting once the probabilistic model is fixed. So we con- 
sider the setting of nonparametric statistics, which is designed to estimate functions 
when no parametric model can be assumed. In particular, this nonparametric setting 
allows us to weaken assumptions considerably. The estimates proposed in this paper 
are based on kernel rules, wavelets expansions, or penalized criteria. Not only are 
they nonparametric, but they also share the following features: 

1. They are obtained by completely data-driven procedures that can be used even by 
neophytes in nonparametric statistics. 

2. They achieve optimal convergence rates. 

3. They do not assume light tails or any shape (exponential, unimodal, etc.) about 
the underlying function. 

4. They adapt to the smoothness of the underlying function. 

Furthermore, the developed strategies considerably extend the procedures proposed 
by [7, 30]. In particular, new data-driven kernel rules are introduced to estimate the in- 
tensity of inhomogeneous Poisson processes. We also derive a lasso-type estimate for 
recovering interaction functions of multivariate Hawkes processes when observing n 
trials. Some new interpretations of the estimate and connections with classical tools 
of the neuroscience literature such as joint peristimulus time histograms (JPSTH) and 
cross correlograms are also proposed. Theoretical results are established by using the 
oracle approach (see later). 

The article is organized as follows. We first explain how subsampling can over- 
come the issues raised by plug-in for goodness-of-fit tests for the special case of 
the K.S. test. Then we extensively discuss adaptive nonparametric estimation and its 
advantages with respect to parametric estimation. This is illustrated on Poisson or 
Hawkes processes and a wide range of nonparametric methods are proposed. Finally, 
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some simulations have been performed and real data sets coming from the recordings 
of a sensory-motor task (that can be found in [15], for instance) are analyzed thanks 
to these new methods. Most of the analysis has been performed with the software R. 
We refer to [7] for a complete list of its advantages. 

Let us introduce succinctly the main notions. More mathematical insight on the 
subject can also be found in [3 1]. For more-to-the-point definitions in link with neuro- 
science, and heuristic interpretations, we refer the interested reader to the very limpid 
article of Brown et al. [2] on the time-rescaling theorem. In the sequel, a point pro- 
cess A' is a random countable set of points. For all measurable subset A, N{A) is the 
random variable giving the number of points of in A. The associated point measure 
dN is defined as follows: for all measurable function /, 

f f(x)dN(x) = J2 /(^)- 

TeN 

To a finite point process on the positive real line, one can associate the correspond- 
ing counting process {Nt)t>o = (N([0, t]))r>o and its compensator (A(0)f>o with 
respect to some given filtration (history). Most of the time, a conditional intensity 
k{-) depending on the past history exists and in this case 

A(t) = / X(u)du. 
Jo 

The function A( ) is therefore continuous nondecreasing. This is also the time- 
transformation on which the time-rescaling theorem is based [2]. In the sequel, 

V 

Xp > 0 means that the sequence Xp converges in probability toward 0 when 

C 

p tends to infinity; Xp > X means that the distribution of Z„ tends to the one of 

X when p tends to infinity. 



2 Goodness-of-Fit Tests: The Plug-in Drawback and Subsampling as a Possible 
Universal Solution 

Once spike trains have been obtained and sorted, neurophysiologists often perform a 
very basic data analysis, which consists in testing several features such as stationarity 
for instance among other statistical inferences [7]. Following Ventura et al. [8], the 
first step of a "good practice" is usually to test whether the observed spike train is 
homogeneous Poisson or not. But it is usually admitted that real spike trains cannot 
be that simple and this hypothesis is most of the time rejected. To explain the re- 
jection, the next step, still following [8], is to impute it to a lack of stationarity or 
to something more complex. It means that we have to test whether the process is an 
inhomogeneous Poisson process or not. For this purpose, one uses the time-rescaling 
theorem (see [32] but also [4, 31]) under the hypothesis that the process is a Poisson 
process with deterministic intensity X{-). Its associated compensator A{-) is in this 
case deterministic as well. The time-rescaling theorem, in its simplest version, states 
therefore that if is a Poisson process with intensity !(■), observed on [0, T^ax], then 
J\f — {X = A(T) : r e A^} is an homogeneous Poisson process on [0, A(T'max)] with 
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intensity 1, fact which can be tested by practitioners. However, there is a misspecifica- 
tion in the method since k(-) is unknown. The most popular and widely used method 
in neuroscience consists in plugging an estimate X(-) in [8]. As explained in the Intro- 
duction, we first illustrate the drawbacks of noncautious plug-ins for goodness-of-fit 
tests on the K.S. test, which has already been observed by [27]. We then propose a 
remedy to overcome these drawbacks based on subsampling. 

2. 1 Elementary Situation for Illustration 

Let us illustrate our purpose on a very basic situation. Assume that one observes 
Xi, . . . , X,, n independent and identically distributed (i.i.d.) real variables with cu- 
mulative distribution function (c.d.f.) u — >• F{u) = P(Xi < u). Given Fq a c.d.f., we 
can test whether the hypothesis Hq: "F — Fq" is true or not. To do so, let us first 
define F„ the empirical distribution function associated with the X, 's by 

1 " 

u Fn{u) = - ^ 1(A',<»)- 
" (=1 

If n is large enough, F„(u) is close to F{u) for any u. The K.S. test is therefore based 
on the statistic 

KSn = sup|p;,(m) - Fo{u)\. (1) 

LI 

Under Hq, if Fq is continuous, the distribution of KS,, is known and it does not depend 
on Fq, so it can be tabulated [27]. For any a e (0, 1), let k„j-a be the 1 — a quantile 
of this distribution. The classical (without plug-in) K.S. test consists in rejecting Hq 
whenever KSn > ^«,i-ff and this test is of exact level a. Note also that when n tends 
to oo, the random variable ^KS,, tends in distribution to a tabulated distribution JC 
(see [33]). As a consequence, if ki-a is the 1 — a quantile of /C, ^k„ \-a tends 
to ki-a and the approximation is valid as soon as n > 45 [34] (see also Durbin's 
modification in [27]). 

Often, the c.d.f. Fq is unknown since it depends on one or several unknown pa- 
rameters and a natural idea consists in estimating it to use the previous procedure. 
This idea, extensively used in neuroscience, can lead to false results. For illustration, 
assume for example that we wish to test the hypothesis Hq "the X, 's are exponential 
with unknown parameter X." Note that this hypothesis is often tested on the inter- 
spike time intervals (ISI) [24] in order to test whether the observed spike process is 
an homogeneous Poisson process with unknown intensity X. Following the scheme 
described previously, a natural procedure to test the exponentiality of the X, 's could 
be the following: 

(i) Estimate X by X = l/X, where X is the empirical mean of the Z/'s: X = 

(ii) Plug in the estimate X and estimate Fohy u F(u) = 1 — exp(— 1m). 

(iii) Form the K.S. statistic (1) by replacing Fq by F . This leads to KS^^K 

(iv) Reject Hq whenever /TS''^'' > 

The p-values of this test are represented in Fig. 1. If the distribution of the test 
statistic was correctly predicted by the quantiles kn,i-a, then the repartition of the 
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rank 

Fig. 1 Repartition of tlie p-values in a K.S. test of exponentiality witli plug-in. Graph of the p-values 
as a function of their rank. A n = 40 i.i.d. sample of exponential variables with parameter J. = 20 has 
been drawn 1000 times. Each time a p-value has been computed either by estimating the parameter k and 
performing the K.S. test with exactly the same sample (iVi black), or by estimating the parameter X on half 
of the sample and performing the K.S. test on the other half (in red), or by estimating the parameter k 
on the whole sample and performing the K.S. test on a subsample of size n^^^ (in green). Note that the 
estimated level (i.e., the number of p-values smaller than 0.05 divided by 1000) is in the first case of 0.009, 
of 0.12 in the second case, and of 0.039 in the third case. Those levels and curves are stable with respect 
to the sample size: Similar results are obtained for a larger sample size (n = 200 and n = 1000) 

p-values should be close to the first diagonal of the graph (see [35]). Clearly, the 
curve is above the diagonal and the test is too conservative, which means that the test 
will accept Hq more than required. The previous procedure fails in obtaining good 
results since, roughly speaking, the same data are used to estimate X and to compute 
the test statistic. For very specific c.d.f., this problem can be overcome by computing 
the distribution of KS^^^ (see [27] for exponential and Gaussian cases). However, this 
is based on a trick that makes distributions, in those specific cases, independent of 
the unknown underlying parameter X. Therefore, this solution cannot be adapted to 
complex situations such as the inhomogeneous Poisson process framework described 
above in the neuroscience field [8]. 

To be more careful and to avoid dependencies between X and F„ , we could use the 
following "split into two parts" procedure where n is assumed to be even. 

(i) Estimate 1 by X = l/X, where X is the empirical mean of the first half of the 
Xrs:^^2/nj:';i]Xi. 

(ii) Plug in the estimate X and estimate Fohy u F{u) = 1 — exp(— 1m). 

(iii) Form the K.S. statistic (1) by replacing Foby F, but also by replacing F„ by the 
empirical cumulative distribution function only based on X„/2+i , . . . , X,,- This 
leads to KS'^^\ 

(iv) Reject Hq whenever KS^^^ > k„/2,i-a- 

The /7-values of this test are represented on Fig. 1 . Surprisingly, the distribution 
of the p-values shows that the resulting test is not conservative enough. Indeed, the 
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test will reject Hq more than required and this procedure is even worse than the first 
strategy. Therefore, we turn toward a much more universal strategy, subsampling, 
thanks to the following result (see the Additional File 1 for the proof). 

Proposition 1 Let X\, . . . , Xp be p i.i.d. variables with c.d.f. F assumed to be con- 
tinuous. Let Fp be the associated empirical distribution. Assume that F is a consis- 
tent estimate of F such that 

Vpsup|F(jc) - f (x)| — ^ 0. (2) 



Then 



JpsuplFpix) - F(x)\ > K. 



Therefore, it remains to find F satisfying (2). In most of the parametric 
cases, and in particular in the exponential case. Assumption (2) does not hold 
if F is based on the same data as Fp. Assumption (2) may hold if p is much 
smaller than n, the whole sample size, as illustrated by the following strategy. 

/^est 1 

1 . Estimate A by A, = l/X, where X is the empirical mean of the Xj 's on the whole sample 
size n. 

2. Select randomly a subsample 5 of the trials with cardinality p — pin), such that 

p(n)/n — > 0 (for instance take p(n) = Jn or pin) = n^l^). 
n^oo 

3. Compute on 5 the following empirical cumulative distribution function: 

Vx>0, Fs(x)=-J2HXi<x]- 
^ ieS 

4. Take ki_a the \ — a quantile of the asymptotic distribution K.. 

5. Reject Hq: "the distribution of the X, 's is exponential" whenever 

sup \Fsix) - F{x)\ > ki_a, 

xe»+ 



where for any x > 0, 



F(x) = l-e~^\ 



Technical arguments of Additional File 1 prove that the previous test is of exact 
level a asymptotically. More importantly, in practice this conclusion remains true 
even for relatively small values of n as shown in Fig. 1 illustrated with n = 40. Even 
if this test is not as powerful as the one described in [27], it has the main advantage 
to be almost universal. It can be adapted to most of parametric situations, since the 
use of subsampling makes the condition (2) quite easy to fulfill. 

We want now to adapt this method to the more general scheme of goodness-of-fit 
tests for counting processes. From now on and whatever the situation, p will always 



^ Springer 



Page 8 of 41 



P. Reynaud-Bouret et al. 



correspond to the size of a subsample, i.e., a positive integer mudi smaller than n the 
total number of observations. 

2.2 Aggregated Test of Hq: "The Observed Processes Are i.i.d. Poisson Processes" 

To fix notation, we consider in the sequel that we observe n i.i.d. trials. Consequently, 
we have access to Ni, Nn, n i.i.d. point processes observed on [0, T^ax] repre- 
senting the n i.i.d. spike trains of a fixed recorded neuron during Tinax seconds. 

It is not possible to assess on just one realization whether a point process is a 
(non necessarily homogeneous) Poisson process or not since the variations of the 
repartition of the points between different parts of one trial can either be due to non- 
stationarity or to more complex dependency structures that cannot be studied on just 
one run. 

The first simple way to use the repetition of the trials is to use aggregation, which 
can be viewed as an empirical sum on the point processes. More precisely, the aggre- 
gated process over the processes Ni, . . . , Np is defined by 

p 

N^-P ^ y ^. or equivalently dN"-'' ^J^^^i- 

i = l....,p i = i 

By classical properties of Poisson processes [4], if the processes are i.i.d. Poisson 
processes with compensator A(-), then A^" is also a Poisson process with compen- 
sator pA{-). This implies that conditionally to the event {A^'''''([0, Tinax]) = «tot}, the 
observed points of N"-'' behave like an Mtot i-i-d- sample of c.d.f. 

A{t) 

t ^ F(t) = 



^(Tmax) 

Since F is unknown in our present situation, one has to estimate it, which leads to 
exactly the same plug-in problem as before. Fortunately, we are able to prove the 
following result. 

Proposition 2 Let Ni, . . . , Np be p i.i.d. Poisson processes with compensator A(-), 
assumed to be continuous, on [0, Tmax] ■ Let f/v°''([0,7'max]) associated empirical 

distribution, defined for any x by 

where N"'P is the aggregated Poisson process. Assume that F(-) is a consistent esti- 
mate of F(-) = ^(OMCTmax) such that 

^"•''([0,7;nax]) sup \F{x)-F{x)\ ^0. (4) 

Then 



^"•'^([0, r^ax]) sup |F^<=,P([O.W])(-^) - ^■ 



-<:e[0,rmax] 
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Once again, subsampling (i.e., ciioosing p much smaller than n) gives us estimates 
F satisfying (4). Two different approaches lead to two distinct tests. First, let us use 
the empirical c.d.f. on the whole sample. 

/^est2 

1. Take F as ^w."([0,rmax])' empirical c.d.f. of the whole aggregated process N"-" 
over the n trials (see (3)). 

2. Select randomly a subsample 5 of the trials with cardinality p — p(n), such that 

p(n)/n — > 0 (for instance take p(n) = Jn or pin) = r?-!^). 
n— >oo 

3. Aggregate the p processes in 5 to form N'''P and f^A''''P([0,rmax]) Proposition 2. 

4. Take ^i_q, the \ — a quantile of the asymptotic distribution /C. 

5. Reject Hq: "The n observed processes are i.i.d. (nonnecessarily homogeneous) Poisson 
processes" whenever 

JA'"'/'([0, Tmax]) sup \Ft^a.p(^Qj^^^)(x)-F{x)\>lx_a. 

\ .«-e[o.w] y 

In Additional File 1, we prove that this test is of exact asymptotic level a, as soon 
as the compensator yl(-) is continuous and this even if !(•) does not exist. However 
its practical performance are poor (see later). A slightly more useful test can be ob- 
tained by using smoother and more elaborate estimates F satisfying (4). We obtain 
the following testing procedure. 

1. Select randomly a subsample 5 of the trials with cardinality p — p(n), such that 
p(n)/n tends to 0. 

2. Use all the n observed processes to obtain a A.(-) such that, if the processes are Poisson 
processes with intensity X{-), one can assume that 



(n) / \k(u) 

Jo 



>0. (5) 

Jo ' 

3. Take 

F(t)- 



4. Aggregate the p processes in 5 to form N'''P and ^a?«.p ([0,7^3x1) ''^ Proposition 2. 

5. Take k[-a the I — a quantile of the asymptotic distribution IC. 

6. Reject Hq: 'The n observed processes are i.i.d. (nonnecessarily homogeneous) Poisson 
processes" whenever 

yA'«.P([0,W]) sup \Fn''.p([o,t^,,])(x)- F{x)\>~ki_c,- 

\ .«-e[0.rmax] y 

In Additional File 1, we prove that the previous test is of asymptotic level a. 
Note that Condition (5) can be demanding and rejection can be due to nonfulfillment 
of this condition. For instance, estimates X based on parametric estimates on a pre- 
scribed parametric model (such as maximum likelihood estimates for instance, see 
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Fig. 2 Aggregation versus cumulation. Description of the way the points are gathered together for aggre- 
gation or for cumulation. On the left-hand side, the first five lines correspond to a trial, the sixth line being 
the aggregated process. On the right-hand side, the same five lines are put together to form the cumulated 
process 



[8]) fulfill (5) if the prescribed model is true, but cannot fulfill this condition if the 
prescribed parametric model is not true. Hence, using parametric estimates in this 
setting lead to test both Hq and "the prescribed parametric model is correct," which 
is not satisfying. Therefore, it is natural to make no parametric assumption on the 
underlying model for k(-) and to try to fulfill (5) by using nonparametric estima- 
tion. 

Finally, as already observed by [8], aggregation can dilute the dependencies be- 
tween the points. Therefore, Tests 2 and 3 cannot be really powerful as we will see 
later. 

2.3 Cumulated Goodness-of-Fit Tests 

Another way to use the repetition of the trials, is to cumulate the p processes instead 
of aggregating them. This difference is made more explicit in Fig. 2. When processes 
are aggregated, points of very different trials can be very close, which can dilute 
dependencies between occurrences. This cannot occur for cumulated processes. With 
this method, we can also test models that are more general than Poisson processes. 
Those general models are usually described through their conditional intensity 
which represents the probability of occurrence of a point at time t given the past 
before t. So, defining a model through its conditional intensity is the easiest way to 
model the dependence between points. For instance, when we assume the conditional 
intensity X to be a deterministic function /, we are assuming independence with re- 
spect to the past. This is equivalent to assuming that the process is an inhomogeneous 
Poisson process with intensity A. = / . Therefore, testing Hq : "the observed processes 
are i.i.d. with conditional intensity k — f and unknown deterministic function /" is 
equivalent to testing Hq: "the observed processes are i.i.d. inhomogeneous Poisson 
processes." 



Springer 



Journal of Mathematical Neuroscience (2014) 4:3 



Page 11 of 41 



More generally, we wish to test a nonparametric hypothesis on the conditional 
intensity. An example, more developed in the next section, is the multivariate Hawkes 
process, which models the dependence between the spikes of different neurons via 
several interaction functions, for which we do not want to give a parametric form. Let 
us give just a simple expression of this process to illustrate our set-up, with only one 
process. The classical self-exciting Hawkes process has conditional intensity given 



where fi is a. positive real parameter and h a non negative integrable function with 
support in and with / = {pi,h). For instance, if the function h is supported by 
the interval (0, 2], then the probability of occurrence at time t randomly depends on 
the occurrences of the process on [r — 2, r). Testing whether the process is a classical 
self-exciting Hawkes or not can be rephrased as testing whether the process has con- 
ditional intensity given by the form X f defined in (6), with unknown /. Other famous 
examples in biomedical areas such as the multiplicative Aalen intensity or the Cox 
model can be found in [29]. 

As in the previous subsection, we use the time-rescaling theorem but in a deeper 
way. Remember that the general time-rescaling theorem [2] states that for any point 
process N on [0, Tmax] with compensator A{-), the point process J\f —{X — A{T) : 
r e A^} is a Poisson process with intensity 1 on [0, A{T^-^y^y\. Therefore, it is more 
interesting to cumulate the processes after time-rescaling than in the usual time space 
[0, Imax]- For general conditional intensity models, A{-) is random. Therefore the 
state space [0, ^(Tmax)] is also random in general. Moreover, when we are dealing 
with p i.i.d. processes N\, . . . , Np, each A^, has a different compensator A, (•) which 
depends on the history of the ith trial. So except in the Poisson case where A(-) is 
deterministic, we do not apply the same transformation to all the points. We finally 
have to deal with p processes A// = {X = 71, (T) : T e Ni] that are Poisson processes 
of intensity 1, and whose occurrences lie in [0, AiiTmax)]- Even if the Ai{T^ia.) are 
i.i.d., they are not equal in general. 

This leads to two main remarks. First, it is not possible to aggregate in gen- 
eral the time-transformed processes since we would aggregate processes with dif- 
ferent lengths (see Fig. 2). Therefore, Tests 2 and 3 cannot be transferred to the 
most general case straightforwardly. However, one can cumulate those processes 
as done in Fig. 2 and this even if the intervals have different lengths. The result- 
ing process Af^-P is therefore a Poisson process with intensity 1 on the random in- 
terval X — [0, ^f_i ^/(Tmax)] (see also Additional File 1 for a more precise for- 
mula and a proof of this statement). The second remark consists in noting that 
11f=i ^i (7max) being a random quantity, it is not true in general that conditionally 
to the total number of points in I, the points of Af^-P behave like an i.i.d. uniform 
sample, and in the sequel we shall need to restrict ourselves to an interval of the form 
[0, pd] with a deterministic bound p6, which is with high probability, smaller than 



by 




(6) 



ELl^/(7max). 
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Besides we have to deal with estimation of unknown transformations Ai(-). For 
this purpose, we introduce estimates of the type f Ai (f) = k; {u)du, where A, (■) 
estimates the conditional intensity of the /th process A^, . We obtain a cumulate 
process Af^'P built from the yl, (-)'s. We have the following equivalent to Proposi- 
tion 2. 



Theorem 1 Let Ni, . . . , Np be p i.i.d. processes with respective conditional intensity 
ki (•)■ Assume that there exist nonnegative estimates A,; (•) ofkj (•) such that 

P r Tmax I \ P 

A,(m) — Xi(u)\du I > 0. (7) 



'-"iU 



/)— >-oo 



Then, for all 9 > Q such thatE,{Ai{Ttnax)) > 0, 



Af'''P{[Q, pO]) sup 
«e[0,l] 



XI hx/(pe)<u} - u 



> iC. 



It is now easy to turn this result into an operational test, using subsampling. 
'Test 4 

1. Select randomly a subsample S of the trials with cardinality p = p(n), such that p(n)/n 
tends to 0. 

2. Use all the n observed processes to obtain a / such that iiX = Xf, one can assume that 



'\Y1 \(Xf)i{u)-{kf)i{u)\duj-—^0. 



P(nr^'^[y]\ \{}.f)i{u)-(kf)i(u)\du\-—^0. (8) 
3. For all ; in 5, take 



and change time of the /th process accordingly to obtain Mj = {X = Ai(T) : T e Ni] on 
[0,i,(r„ax)]. 

4. Cumulate the p processes A/} for i in S to formAA'^-''. 

5. Take the 1 — a quantile of the asymptotic distribution K.. 

6. Fix 9 > 0, strictly smaller than p^' X]ie5 (Tmax)- 

7. Reject Hq: "The n observed processes are i.i.d. processes with conditional intensity of the 
form A, f and unknown /" whenever 



A/''^'P([0,p6l]) sup 

a£[0,l] 



X! hx/(pO)<u} - U 



M':-p(lQ,pe]) 



In Additional File 1, we prove that the previous test is of exact asymptotic level 
a as soon as E(yl, (Tinax)) > ^- There exists a simpler form of this test when dealing 
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with Poisson processes since in this case compensators do not depend on / . 

1. Select randomly a subsample S of the trials with cardinality p = p(n), such that p{n)/n 
tends to 0. 

2. Use all the n observed processes to obtain a X(-), such that if the processes are Poisson 
processes with intensity X(-), one can assume that 



Jo '^^^ 



Vp(n){ I \X(u) - k(u)\du ] ^0. (9) 

3. Take 



A{t) = f {iiu)) du, 
Jo 

and change time of the !th process accordingly to obtain Mi = {X = A(T) : T e Ni] on 
[0, i(rn,ax)]. 

4. Cumulate the p processes in S to form AA'^'''. 

5. Take the 1 — a quantile of the asymptotic distribution K.. 

6. Fix 6 > 0, strictly smaller than A{Ttn-dx)- 

1. Reject Hq: "The n observed processes are i.i.d. non necessarily stationary Poisson pro- 
cesses" whenever 



N'='P([Q,pe]) sup 
uelO.ll 



N'^'P([o,pe]) 

X£AA'-P,X< 



This test, as a special case of Test 4, is also of exact asymptotic level a as soon as 
^(Tmax) > d. Tests 4 and 5 are more powerful to detect dependencies or to reject the 
Poisson assumption than Tests 2 or 3, as we will see later. 

As for Test 3, and for exactly the same reasons, we want to find nonparametric 
estimates satisfying (8) or (9). We provide in the next section powerful tools to deal 
with this problem and theoretical guarantees of performance of these estimates. 



3 Nonparametric and Adaptive Estimation 

3.1 Why Is Adaptive Estimation Useful? 

Nonparametric estimation, and in particular nonparametric estimation of Poisson pro- 
cess intensity, is at the root of most of the data analyses performed on spike trains. 
Indeed, peristimulus time histograms (PSTH) [36] are usually the first graphical rep- 
resentations of an experiment. Those histograms have usually a fixed length for each 
interval (typically 10 ms) and are quite noisy from a statistical point of view (see, 
for instance, the representations of [8]). Therefore, there have been several attempts 
to provide smoother estimates, either by kernel estimates (see, for instance, [30]) or 
by projection on an orthonormal basis (see, for instance, [7] for the use of splines). 
These methods provide a first illustration of the data with as few assumptions as pos- 
sible on the underlying "true" firing rate. They are originally not linked at all to any 
statistical or probabilistic models and constitute descriptive statistics. In particular. 
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no parametric assumption on the underlying intensity is made at this step, the para- 
metric model and its associated maximum likelihood estimator (MLE) being given 
in a second time once the shape of the curve is guessed [8]. Because of this lack of 
parametric assumption, those estimates seem to be the best candidates at first glance 
for the estimate X that needs to be plugged in Tests 3 or 5. 

However, the problem of the convergence rate remains. In all these methods, there 
is a tuning parameter that needs to be chosen: it is the length of the interval for his- 
tograms, the bandwidth in kernel rules or the number of coefficients in the orthonor- 
mal expansion. The problem of the choice of this parameter has first been tackled 
very roughly in the neuroscience literature by choosing a fixed value. On the real 
data presented here or on the ones in [8], it was usually considered that a bandwidth 
of 50 or 100 ms was a good choice. However, such a very rough choice cannot guar- 
antee a convergence rate when n goes to oo. Indeed let us look more closely at the 
kernel estimate. 

For the i.i.d. observed point processes Ni, . . . , Nn, the kernel estimate with kernel 
K and bandwidth h is given by 

X^t'(x)^-f2 j Kh(x-u)dNi(u)^- j Kh(x-u)dN"'"(u), (10) 
i=i 

where A/^" " is the corresponding aggregated process and where Kkiu) — {l/h) y. 
K(u/ h).lf we assume that the observed processes are inhomogeneous Poisson pro- 
cesses with intensity X, N"'" is also an inhomogeneous Poisson process with intensity 
nX and, therefore, 

E[Af''(x)] = {Kh-kX){x), Vx e M, 

where ★ denotes the convolution product. So, the expectation of X^'' constitutes a 
regularized approximation of To measure the performance of i,f '' , we compute 
its L2-risk (see further details in Additional File 2): 

Elllf'" -;^||, = \\KH-kX-X\\l+^^^^\\K\\l, (11) 
II n wi ^ nh 

which is classically interpreted as a bias-variance decomposition. Therefore, if h is 
fixed, the variance term goes to 0 whereas the bias remains fixed so that the L2-risk 
of the estimate does not go to 0. Consequently, a fixed choice for the bandwidth 
is not convenient and it is essential to choose h — h{n) tending to 0 with n. The 
dependence of h with respect to « is a problem that has been extensively studied in 
the density framework, a setting close to the present one since conditionally to the 
total number of points, the observed points of N"'" behave like an i.i.d. sample of 
density A(-)M(7max)- We refer the reader to [37] for a review. The main conclusion 
of such a study is that if X is regular and if the regularity is known then we are able to 
choose h{n) such that the L2-risk tends to 0 at a known rate of convergence depending 
on the regularity. Furthermore, the larger the regularity, the faster the rate. Typically, 
if the rth derivative of X is bounded in the L2 sense, then it is possible to choose K 
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and' h(n) x « ^/(^r+i) ^^^j^ jj^^j. L2-risk behaves as 

E||if'-A||^x«-2'-/(2'-+i). 

In this setting, this choice can be applied to Tests 3 and 5, since the Markov inequality 
implies that (5) or (9) are satisfied with p{n) — and 5 < (2r)/(2r + 1). The choice 
r = 1 gives 5 < 2/3 and r — 2 gives & < 4/5. 

Of course, in practice the choice of the bandwidth is capital. Since the smooth- 
ness of k is unknown, the practitioner cannot use the previous choice. Furthermore, 
guessing the order of magnitude of h (n) is not enough to achieve good performance 
since the leading constant plays an essential role. Hence, the theoretical considera- 
tions developed before do not solve the practical problem. Several directions have 
been proposed to overcome this problem. One of the most famous ones consists in 
using leave-one-out or other cross-validation methods [30, 38]: among a finite fam- 
ily of fixed bandwidths, such methods choose the best one in an asymptotic setting. 
However, to our knowledge, nothing can be said when the family of bandwidths is 
not fixed and some bandwidths tend to 0 with n . It is not clear at all that the resulting 
estimate achieves a prescribed rate and, therefore, it cannot be used for the proposed 
tests in particular. Other methods based on the rule of the thumb (and variations of it) 
have been proposed in the density or the Poisson setting [8, 39], and in this case the 
resulting bandwidth is of the form h(n) — Cn~^/^ for various possible choices of the 
constant C. Generally, those choices lead to poor results as noted by [8] (see also our 
the simulation study). 

Adaptive estimation [37] aims at tuning in a data-driven way the unknown pa- 
rameters of those methods (kernels, histograms, etc.) such that the resulting estimate 
has good practical performance and a guaranteed convergence rate. The adaptive es- 
timates are usually mathematically proved to achieve the best possible rate of conver- 
gence and this even if the regularity is unknown. Moreover, they do not depend on any 
restrictive assumption such as, for instance, some parametric assumption. The only 
assumption lies in the underlying probabilistic model (for instance, one assumes that 
the processes are inhomogeneous Poisson processes). Their reconstructions are there- 
fore much more trustworthy than other methods for which those extra assumptions 
may not be fulfilled. As a conclusion, adaptive estimates constitute ideal candidates 
to be plugged in Tests 3, 4, or 5. 

The main aim of next subsections is therefore to present adaptive estimates in the 
Poisson or in the Hawkes model that will have these good properties. 

3.2 Adaptive Estimation for Poisson Processes 

3.2.1 Kernel Estimates 

As mentioned previously, the Poisson setting is very close to the density setting. In 
the density setting, the main adaptive method for choosing a bandwidth is the Lepski 



If (an)n>Q and (ii)i)n>0 are two sequences, a(n) x b(n) means that there exists two positive constants 
c\ , C2 such that for all n>0, cia„ < b„ < C2a„ . 
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method, which has been recently updated to the multidimensional framework and 
to deal with the problem of choosing the leading constant of the bandwidth. Due to 
Goldenshluger and Lepski [40], it is referred in the sequel as the GL method. We 
propose here to adapt this method to the Poisson setting in the following way and to 
prove its adaptive properties. 

We consider a set of bandwidths V. and their corresponding kernel estimates '' . 
The bias-variance decomposition shows that the parameter h which minimizes the 
right-hand side of (1 1) with respect to /; e 'H is the best possible choice. It is called the 
oracle bandwidth: since it depends on X{-), it cannot be used in practice. To propose 
a data-driven choice of the bandwidth by a GL method, we define for any h,h' eTL: 



- J2 (Kh*Kh^)(x-T)^{Kh*in'')ix), Vxe: 



TeN"-" 



then for > 0, we set 



A(h) :— sup 

h'eH 



(1 + ??)(i + ||/:||i)||/f||2VA'"'"([0, U) 



The Additional File 2 shows that A{h) constitutes a good estimate of the bias term 
(see (11)). Finally, we select the data-driven bandwidth as follows: 



h :— argmin 

hen 



A{h) 



(1 + >?)(1 + ||g||l)||g||2VA^"-"([0, Tmax]) ' 



(12) 



which allows us to estimate 1( ) by using 



(13) 



Note that in (12), WKWIn"'" ([Q, T^-^y})/ (n^h) is an unbiased estimate of the variance 
term in (1 1) and therefore the previous criterion mimics the bias-variance decompo- 
sition of the risk of up to some multiplicative constant. Once K, Ti. and r] are 
chosen, we obtain a turnkey procedure. The following theoretical result justifies our 
procedure. 



Theorem 2 IfH C {D''^ :£)=!,..., Dmaxl with Dmax = &nfor some 5 > 0, and if 
\\k\\oo < oo, then. 



E||FL_^||2 ^ .^^ 



^l|/:||2)+C2«-\ 

nh J 



where C\ is a constant depending on \\K\\i and rj and Cj is a constant depending on 
S, rj, II A' II 2, II -K" 111, 11^ 111, and \\X\\oo. 

Theorem 2 combined with (11) shows that our procedure mimics the performance 
of the oracle up to the constant C\ and up to the term C2«~' , which is negligible when 
n goes to +00. It is classically called an oracle inequality, which is the main property 
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of adaptive estimates. In particular, one can take the family Ti — {!,..., L^«J /> 
which grows with n and it is possible to select a bandwidth tending to 0 with n. If the 
rth derivative of X(-) is bounded in L2, then the choice h(n) x n~i/(2''+i) i(j 
family H and the oracle inequality gives straightforwardly that 



which is the optimal rate of convergence over such spaces. This rate is achieved, even 
if we do not know in advance the regularity r of k, which is from a theoretical point 
of view the main improvement with respect to the theory described in the previous 
subsection. 

If K is the Gaussian kernel, then HA"!]! = 1 and ||A'||2 = 2~^^^7T~^^'^ . Moreover, 
Kh * Kf,' = K ^ i2_^_i a ^'1'^ ^ straightforward computation shows that explicit formula 

for II ij'''' — iif*' II 2 are also available. It is consequently very easy to implement the 
method, the computational cost being almost of the same order as cross-validation. 
We will see in the simulation study that this practical choice is also quite robust. 

3.2.2 Histograms 

In the Poisson set-up, there are several ways to select data-driven partitions that lead 
to adaptive histogram estimates. For instance, one can use model selection as in [41]. 
Model selection can either select a regular partition or an irregular partition on a 
grid. When regular partitions are considered, the resulting estimator satisfies an ora- 
cle inequality similar to the oracle inequality established in Theorem 2 for the kernel 
rule combined with the GL method. Indeed the bin for the histograms plays exactly 
the same role as the kernel bandwidth. Therefore, it leads to similar theoretical per- 
formance, except that the histograms cannot become smooth enough to guarantee an 
optimal convergence rate for regular intensities (namely r > 1). Therefore, the choice 
of regular partitions is probably not the best one and one may prefer the GL method. 
The data-driven choice of the partition becomes much more interesting when the par- 
tition is not forced to be regular. Indeed irregular partitions can capture a fast increase 
of the firing rate followed very quickly by a fast decrease at some particular moment 
of the experiment, without leading to too noisy estimates as the classical PSTH, since 
over smoother periods, the length of the interval can be much larger. However, the 
method of [41] is too time consuming to be really considered in practice. Another 
possible direction is the context of Markov modulated Poisson processes [42], where 
the algorithms are also quite time consuming without ensuring any adaptive prop- 
erty in terms of convergence rate (despite some possible interpretation with respect 
to hidden Markov processes). 

However, and as already noticed in [41], it is possible in certain cases to interpret 
a model selection estimate as a thresholding rule. We hereafter illustrate in a simpler 
case, the method developed in [43]: If !(•) e L2, we can decompose it on the Haar 
basis. 



E||i^L_;,||2^„-2./(2.+l)^ 
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where ,t(-) = </>(■ — k) with (p — l[o,i) the Haar father wavelet and where 
^jfj^ki.-) = 2'^^i/{2j{- - k)) for 7 > 0 with f = l[o,i/2) - l[i/2.i) the Haar mother 
wavelet. The tij,k^ are the unknown coefficients of A(-) and are given by 

v;.-.,..z. ft,. = /f„.wM,,... 

These coefficients can therefore be unbiasedly and consistently estimated by 

n J 

Given a fixed finite subset of indices m, we obtain an easily computable estimate of 

(j,k)em 

Since the Haar basis is piecewise constant, the previous estimate is also piecewise 
constant on a certain partition V depending on m. A data-driven choice of m there- 
fore leads to a data-driven choice of the partition that can be irregular. Let us fix an 
arbitrary highest level of resolution jo such that 2^" <n< 2"'"'"' and let us consider 
the L2-risk of Xj" such that if (j, k) e m then j < Jq. The bias-variance decomposi- 
tion of A"' can be written as follows: 

= E E + E J2\-Plkhi,k)Pn + Vj,kl(j,k)eml (14) 
J>jo k j<jo k 



where 



Vj,k 



:Var(^j,i)= ^ j fj^^{x)k{x)dx. 



Hence, the best subset m is the set of indices {j, k) such that fij^k > ■s/'^jM - This is the 
oracle choice. A possible data-driven way to choose the indices (j, k) is to choose the 
indices such that are larger than a certain threshold r]jk depending on an estimate 
of the variance v i^k - The choice advertised in practice in [43] is 



2Kln(«)0j,^ + ^— whereO/,^ = ^ / 1/^2 (^5^ 

3n ' n'- J ■'' 



Then we obtain the following thresholding estimator: 

./o 



EE^m1,|^,„>,,„V^M- (16) 



-1 k 



In [43], it has been proved that a slight modification of this estimate satisfies an or- 
acle inequality in the same spirit as Theorem 2. It also generalizes this estimate by 

^ Springer 



Journal of Mathematical Neuroscience (2014) 4:3 



Page 19 of 41 



considering general biorthogonal bases instead of the Haar basis, leading to smooth 
estimates (see [43, 44]). In this case, for a slight modification of the threshold, the re- 
sulting estimate has the same convergence rates as the kernel estimate combined with 
the GL method, up to some logarithmic term, as soon y > 1 . The choice y < 1 has 
been shown to lead to bad convergence rates and the choice y = 1 has been shown 
to work well on extensive simulations in both [43, 44]. This method is easily imple- 
mentable leading to very fast algorithms that are in particular faster than algorithms 
based on the GL method. 

3.2.3 More Sophisticated Procedures 

Thresholding rules and irregular partitions overcome a drawback of kernel estimates 
that suffer from a lack of spatial adaptivity on the time axis. Several attempts have 
been proposed to build more local choices of the bandwidth (see [30] for instance), 
but to our knowledge no mathematical proof of this spatial adaptation has been estab- 
lished, whereas histograms and in particular the previous Haar thresholding estimator 
can adapt the length of the bin to the heterogeneity of the data. But the resulting esti- 
mator is not smooth at all. As explained, we can consider a smoother wavelet basis, 
but this extension does not completely address the issue. 

The best alternative, to our knowledge, when the support of X(-) is known and 
bounded (here [0, Tlnax]) and when X(-) does not vanish for a significant period of 
time, is due to Willett and Nowak [45]. Their method is quite intricate to describe. In- 
formally, a penalized log-likelihood criterion is used to select a piecewise polynomial. 
Both the partition and the degree of each polynomial on each interval of the partition 
are free (on a very refined grid of resolution). Willett and Nowak have proved that 
such an estimator achieves optimal rates of convergence for various classes of regu- 
larity and in an adaptive way. From a practical point of view, a dyadic tree algorithm 
is used. Its complexity is much smaller than a full model selection method on the 
same piecewise polynomial family of models. It is a bit more complex than a thresh- 
olding algorithm, but there exist a program (FreeDegree) in Matlab interfaced 
with C which makes its use in practice quite easy. For a more complete description of 
the method, we refer to [45]. Note that in practice because of its adaptive properties, 
this estimator is able to be piecewise constant when the true intensity is piecewise 
constant but also very smooth (with high degree for the polynomials) when the un- 
derlying intensity is smooth and when the number of points is sufficient. It is also 
able to be spatially adaptive, the underlying data-driven partition being irregular. In 
the sequel, we denote this method 

3.3 Adaptive Estimation for Hawkes Processes 

If inhomogeneous Poisson processes can model nonstationary data, they are not 
appropriate to model dependencies between points. However, several studies have 
established potential dependence of spike occurrences for different neurons. This 
has been detected via descriptive statistics, via independence tests for a given fixed 
model or via model-free independence tests based on permutations (also called trials- 
shuffling) [5,6, 13, 15,22, 46]. 
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One simple model of dependency is the multivariate Hawkes process, which is the 
point process equivalent to the auto-regressive model. It has first been introduced by 
Hawkes [47], as a self-exciting point process, that is useful in particular in seismology 
(see, for instance, [23]). It has also been used to model positions of motifs along the 
DNA molecule [48, 49]. In neuroscience, it explicitly appears in the 1980s with [19] 
and is close in spirit to [50, 51], with the additional advantage of modeling potential 
feed-back between the neurons. 

The multivariate Hawkes process (see, for instance, [52] or [4]) models the instan- 
taneous firing rates of M different neurons, with spike trains A^*^'\ . . . , N^^\ where 
the conditional intensity of the mth point process is defined for any f > 0 by 

\ €=l-^-~ / + 

= ^'"^ + E E h^r^(t-Tt)) ■ (17) 

In (17), the y''"''s are positive parameters representing the spontaneous firing rates 
and the /z^'"''s are the interaction functions and have support included into R^. More 
precisely, before the first occurrence of the multivariate process, the Af('")'s behave 
like homogeneous Poisson processes with constant intensities The first occur- 
rence (and the next ones) affects all the processes by increasing or decreasing the 
conditional intensity via the interaction functions h^^^'s. For instance, if /i^'"^ takes 
large positive values in the neighborhood of the delay d and is null elsewhere, then 
after the delay d of one occurrence of N^^\ the probability to have a new occurrence 
of A^''"' will significantly increase: The process A?'^' excites the process Af^^'^.Onthe 
contrary, if is negative around d, then after the delay d of one occurrence of A^*^^* , 
the probability to have a new occurrence of A^('") will significantly decrease: The pro- 
cess A^^^-* inhibits the process A'^^™) . Note in particular that the functions /jj^"' 's model 
self-interactions. 

The Hawkes process as described above cannot really model nonstationary data. 
Indeed, when t grows (and under conditions on the interaction functions), the process 
converges quite quickly toward an equilibrium, which is stationary (see, for instance, 
[52, 53], and the references therein). If these conditions are not satisfied, the number 
of points in the process grows too fast to be a realistic model for spike trains anyway. 
Hence, Hawkes processes as defined in (17) cannot model nonstationary data, but can 
model dependent data. 

Therefore, we fix an interval [Ti, T2] C [0, Tmax], typically an interval where all 
the estimated mean firing rates seem constant. The aim is to estimate on this interval 

J — \\^ )m=\,...,M' V'i «,m=l,...,M^ 

where it is assumed that the interaction functions are bounded with support in [0, A] 
with Ti > A. 
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Inference for Hawkes models based on the likelihood can be found in the literature, 
in particular, for parametric models [23, 49]. However, in neuroscience, for flexibility, 
the used parametric models are based on a large number of parameters. Therefore, 
they require several thousand spikes per neuron to be observed in a stationary way 
to achieve good estimation [19]. Classical model selection based on AlC and BIC 
criteria has also been used to select the number of knots for the spline estimate [2 1 , 
48, 54]. However, these criteria do not adapt well to irregular functions. This is the 
reason why alternative nonparametric adaptive inference has recently been developed 
in such models. The univariate case (M =1) has been studied in [55], where rates of 
convergence depending on the underlying regularity of the self-interaction function 
have been derived. We can also mention the alternatives proposed in [20, 56] but no 
theoretical validation is provided in those works. 

A multivariate approach, valid for very general counting processes including 
Hawkes processes and based on £i penalties, has been recently developed in [57]. 
Based on minimization of convex criteria, its computational cost is more reasonable 
than procedures proposed in [55] and it is also proved to satisfy oracle inequalities. 
We shall detail this method in the case of Hawkes processes and with piecewise con- 
stant estimates of the underlying interaction functions. 

In the next section that can be skipped at first reading, we describe the method 
in a technical way. Then we give heuristic arguments to understand more deeply the 
presented method (see also [58] for a quicker view on this estimate). In particular, the 
method does not rely on the likelihood, but on a least-square contrast, which can be 
reinterpreted in terms of JPSTH [59]. 



3.3.1 Intensity Candidates and Least-Square Contrast on One Trial 



We first propose a conditional intensity candidate. So for any f with 



7^ = (KxL2([0, A])'' f 



1 ... M/m=i ... m) ■ sf^ s"PP°'^ in A] 



/=((M«,(grki,...,M) 
and ||/||2 = + E E sT^tfdt < oo), 

m m t 



we consider the predictable transformation if{f) — {ir^^\f), . . . ,-^^^\f)) such 
that 



Vr>0, ifl"'\f)^,i 



Am) 



M 

1=1''-°' 



Am) 



(t — u)dNi{u). 



(18) 



Note that X^"''^ = [^^"'\f*)]+. Therefore, for each m, i/^^'Hf) 

can be considered as 

a good intensity candidate as long as it is close enough to the conditional intensity 
)^(m) (gyen jf ^(m) (-y^ takes negative values). We measure the distance between i/^(/) 
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and k by using the classical L2-nonn || • || : 

m=l ^ 
M Y 

= T.f V/'^c/) - [i^t^'^nu'dt. (19) 

Depending on /*, the right-hand side is not observable. But minimizing the last ex- 
pression with respect to / is equivalent to minimizing / i — > y (/) with 

But by definition of the conditional intensity, y (/) is close to yif) with 

yif) = -2j2 [ \l"'\f)dN^"'\t)+Y.( \irl"'\f)fdt. (20) 

which is called the least-square contrast. This expression is observable and can be 
minimized if / is parameterized by a fixed number of parameters. 

One particular parameterization, that is used in practice, is obtained when each 
function g^^^ is a piecewise constant function written as 

K 

^(m) ^ 'Y^am,i,k8~^'^'^((k-\)SM}' (21) 

where 5 > 0 is the size of the bin and K the number of bins. So we have K& = A. The 
(im,i,k^ are the renormalized coefficients of g^^^ on the regular partition of size K. 
Since / — > if'-'^Hf) is linear, one obtains 

M K 

W > 0, iri"'\f) = mC") + J2T.^'nAkS-'^'N^'\[t -kS,t-{k- 1)8)), 

still for / = ((/^'■'"\ {g^i'^)t=\,...,M)m=\,...,M) ■ denote by a*'"^ the column vec- 
tor such that 

(a^™))' = {ix^"'\am,\,i, am,l,^, flm,2,l, . ■ ■ , am,M,K), (22) 
where ' denotes the transpose. Then one can write 

Vf>0, ^/'"^(/) = (Rc,)'a('"\ (23) 
with RCf being the renormalized instantaneous counts given by 
(Rc,)' = (l,5-i/2(cf ')',..., 5-1/2(0^)'), 
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and with cj^^ being the vector of instantaneous counts with delay of A^'^^^ i.e., 

(cf )' = {N^'^it -S,t)),..., N^'\[t -KS,t-{K- 1)5))). 

Hence, by (23), proposing irj"'\ f) as a candidate for the intensity A^™' of A^*^'") 
amounts to proposing a linear combination of instantaneous counts with delay to 
model the probability of the next occurrence of a point in N^"^\ 

Now, minimizing y(f) over such piecewise constant functions is equivalent, by 
linearity, to minimizing 

M 

y(f) = ^(-2(a("")'b('") + (a(»'>)'Ga^'"0 

m — l 

with respect to the vectors a^'"^ The vector b*^™' is observable and is given by 



= (A'«([7^l-7'2]),5-^/y„,l'---'^"^'X,,M) 



where 



and 



n,n,i = ( r N^^\[t -kS,t-(k- l)S))dN'-"'\t)] 

\JTi /k=l,...,K 



G= /" \c,(Rc,ydt. 



Note that the A:th component of n„, f is the number of couples {x,y) with x e 
j^(m) PI [T^j ,T2\,ye N^^^ and (y -x)e ({k - 1)3, k&] and G is a symmetric matrix of 
size \ + MK whose components are the integrated covariations of the renormalized 
instantaneous counts. The solution of this minimization problem is easily available: 
If G is invertible, 

Vm = l,...,M, a<"'' = G-'b<'"\ (24) 

Heuristic arguments show that (24) is a natural expression. We can indeed informally 
write for any m that 

assuming that at time t, the intensity is strictly positive. By linearity of i/'^*^'"', one can 
also write that 

dN^'^Ht) ~ (Rc,)'al"'^ + noise, 
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_i_ noise, 



where a^™' are the coefficients corresponding to /*, assuming that /* can be coded 
in this way. Finally, we obtain 

[ ' Rc,dN^'"\t) = b^'"' ~ / ' Rc,(Rc,)'af + noise ~ Gal"' 
Jti Jti 

showing that the estimate given in (24) should be a convenient preliminary estimate. 

3.3.2 Least-Square Estimates on Several Trials and Connections with JPSTH and 
Cross-Correlograms 

We observe now n trials and, therefore, we have access to (A^/'\ . . . , A^/*''')i=i....,;i an 
i.i.d. sample of a multivariate point process on [Ti , 72] . Each trial has its own history. 
So to each trial /, we can associate as in the previous subsection the matrix G, the 
vectors b^™^ and so on. Depending on the trial /, we denote them G^'', b^"' '' and so 
on. The least-square contrast for these n x M spike trains can then be written as 

K„ (/) = ^-2(a('"))' bC"-')^ + (a^'"))' G«^ a^"'^ (25) 

whose solution is given by 

Vm=l,...,M, a('")= ^^G*''^ ^^b^'"-')^ (26) 

The quantity (^"^j b^™-')) can be reinterpreted in terms of cross-correlograms and 
joint-PSTH, following [59]. Indeed we can write 

= ([A^Wr(m,r2]),5-V2-,^,i,...,3-i/2n^^^^^ 

where for any i 

.7-2 

yJTi ' " / k=l, 




n,n,i = -U,t-{k- \)&])dNl"'\t)^ ^ 



and [A^^'")]" " is the aggregated process over all the n trials for the mth neuron. The 
quantity n„, ^ can be reinterpreted as a particular histogram based on the joint peri- 
stimulus time scatter diagram as the JPSTH or the cross-correlogram (see Fig. 1 of 
[59] and Fig. 3 of the present article). More precisely as detailed in Fig. 3, the counts 
fim.i are close to a cross correlogram except that representations are not based on 
squares but on herringbones. Local features are then preserved, as for the JPSTH. 
Furthermore, the elements of the partition have the same area and can therefore be 
compared more easily. Besides, for small disjoint intervals [T\, T2] with an increas- 
ing parameter A (corresponding to the maximal size of the support of the interaction 
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Fig. 3 Scatter diagrams and histograms. In panel a, we recall how a scatter diagram is constructed for 
each trial (here black crosses for trial 1 and red crosses for trial 2) and then superposed. Various histograms 
can be built with those points, using different partitions. Panel b gives the partition corresponding to the 
classical cross-correlogram. Panel c gives the partition of the JPSTH. The diagonal squares are filled in 
gray : if a point of the scatter diagram lies in one of those square, its coordinates correspond to spikes 
that are very close to each other Two very close spikes (one on Nl, the other one on N2) and their 
corresponding point in the scatter diagram are added, showing that the reverse is wrong. Therefore, there 
are some couples that are close to each other and not counted by the JPSTH as such. Panel d gives the 
partition used for the computation of the vectors f of dimension K (K = 2 here). More precisely, nj 2 
corresponds to the vertical part 2 — >■ 1, whereas 112,1 corresponds to the horizontal part 1 — > 2. The same 
spikes ai'e added, and they are now counted as close to each other 



functions), and for S close to 0, we obtain representations close to the JPSTH, except 
that the limits are not parallel to the axis, but parallel to the diagonal. This change 
of orientation smooths the binning effects. Indeed the quantity that is binned for the 
quantities h,„,t is the delay itself between two points, whereas for the JPSTH, each 
position of the points is first binned. Therefore, two points whose distance is less than 
S are always counted as such in one of the diagonal parallelograms for h,„x, whereas 
they may eventually not be counted in a diagonal square of the JPSTH, because one 
point appears in one bin and the other one in another bin (see Figs. 3c and 3d). This 
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problem of information loss when binning is involved has already been discussed for 
the coincidence counts [15]. 

JPSTH and cross correlograms have been used for a long time in neuroscience, 
without links with any model. The formula (26), for the least-square estimate, shows 
the link between those descriptive statistics (more precisely the h„,j's) and the pa- 
rameters of the Hawkes model. To recover the parameters, we need, in particular, to 
inverse the matrix iYl'i=i G^'^). This matrix quantifies for instance the following situ- 
ation. Assume that M — 3 and that the interaction functions hj ^ and h^^^ are large on 
[0, S] and null elsewhere. We also assume that all the other interaction functions are 
null. In this situation, iii^ (or at least its first coordinate) will be large even if there 
is no direct interaction from on A^i. The matrix (^"-i G^'^) cumulates all these 
features (and also the fixed effect due to the spontaneous parameter, which needs to 
be subtracted) and inverting it enables us to find an estimate of the true interactions. 
See also [58] for a more visual transcription. 

Note, however, that even if many coefficients are null as in the above described 
situation, due to the random noise, the estimates a^"'' have non-zero coordinates al- 
most surely. Therefore, it is difficult to interpret the resulting estimate in terms of 
functional connectivity graph [58]. Moreover, if we wish to capture all the features, 
it is preferable to take A large and S small. Therefore, the number of parameters of 
the model, depending on K — AS~^ , increases. With a small number of trials n and 
a small interval [Ti, T2], the least-square estimate is doomed to be quite poor as the 
MLE[19]. 

To remedy these problems, we now consider €1 penalization to find a nonparamet- 
ric estimate with adaptive properties and prescribed convergence rate. 

3.3.3 Lasso Estimate 

The Lasso method as developed by [57], is based on the following penalized least- 
square criterion, reformulated here in the context of n i.i.d. trials: for any m = 
1,...,M, 

a^™^ e argmin^-2(a('"))'^^b<"'-''^ 

where la*^'"^! denotes the vector whose coefficients are the absolute values of the 
coefficients of a^'"' and where 

is a vector of positive observable weights given by 

yln(n(T2-m . 



dm,i,k^y/'2yln{n(T2- Ti))V,„j^k-\ ^ Bg^k, (28) 
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where 



Vm,l,k 



't-kS,t-(k- l)S))fdN^'"\t), 



3-^/2 sup Nl'^''{[t -kS,t -(k-l)S)), 



i,telTi,T2] 



and with 



d,n,Q = y/2yln{n{T2 - ri))[Af('")]'''" ([72, Ty]) + 



/ln(«(72-ri)) 
3 



Since the criterion (27) is convex, the minimization problem can be performed quite 
easily. The function / e "H associated with a is denoted in reference to the Bern- 
stein inequahty that governs the shape of the weights (see [57]). 

Because the penahy term added to the least-square criterion is a weighted 1 1 -norm, 
the resulting estimate is sparse and many coordinates in a^'"' will be null (see [60] for 
the seminal paper on Lasso methods). This estimate and much more general forms 
have been studied quite intensively in [57]. In Additional File 3, we prove an oracle 
inequality for a slight modification of the present estimate, whose exact form can also 
be found in [58]. 

Let us just present the result informally to highlight the main properties (the com- 
plete version can be found in Additional File 3). An oracle inequality, in the same 
spirit as Theorem 2, is proved. The main difference is that it holds on a event with 
large probability and not in expectation. We have an upper bound of 



that constitutes a compromise, as usual, between a bias term and a variance term. 
Minimizing the bias gives the best linear approximation of k of the form i/f (/) and 
this even if A is not of the form i/{f). In this sense, it applies in particular to Hawkes 
processes with self-inhibition (i.e., negative h^"^'s), which models refractory periods 
[22] and for which f X— (\lr(f))^ is not linear anymore. Finally, (29) leads to 
a control of the left-hand side of (8) adapted to the context of this section. Under 
further technical assumptions, we can then prove that Test 4 can be applied. We refer 
the reader to [57] for more details that are omitted here to avoid too tedious technical 
aspects. 

The last point already developed in [57] is that Lasso estimates are most of the 
time biased in practice. To overcome this problem, a two step procedure is proposed. 
It consists in finding the non zero coefficients of and performing a classical least- 
square estimate on this support. We denote this two-step estimate /^°. 




m 



(29) 
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4 Practical Performance 

4. 1 Description of the Data 
4.1.1 Real Data 

The data used here are a small subset of already partially published data in previ- 
ous experimental studies [15, 22, 61, 62]. These data were collected on a 5-year-old 
male rhesus monkey who was trained to perform a delayed multidirectional pointing 
task. The animal sat in a primate chair in front of a vertical panel on which seven 
touch-sensitive light-emitting diodes were mounted, one in the center and six placed 
equidistantly (60 degrees apart) on a circle around it. The monkey had to initiate a 
trial by touching and then holding with the left hand the central target. After a delay 
of 500 ms, the preparatory signal (PS) was presented by illuminating one of the six 
peripheral targets in green. After a delay of either 600 or 1200 ms, selected at random 
with various probability, it turned red, serving as the response signal and pointing tar- 
get. During the first part of the delay, the probability for the response signal to occur at 
500 + 600 ms = 1.1 s was 0.3. Once this moment passed without signal occurrence, 
the conditional probability for the signal to occur at 500 + 600 + 600 ms = 1.7 s 
changed to 1 . The monkey was rewarded by a drop of juice after each correct trial, 
i.e., a trial for which the monkey touches the correct target at the correct moment. 

Signals recorded from up to seven independently moving microelectrodes (quartz 
insulated platinum-tungsten electrodes, impedance: 2-5 MO at 1000 Hz) were am- 
plified and band-pass filtered from 300 Hz to 10 kHz. Single unit activity was ob- 
tained by performing an online discrimination of spikes on each electrode. Spikes 
were firstly selected by taking into account their amplitude using an online window 
discriminator with high-pass and low-pass filters. In cases where spikes were not 
discriminable due to their amplitude only, the electrode was moved until the signals 
were sufficiently distinct to be discriminable on this basis. Although off-line spike 
sorting was available, it was not used in this study. Indeed, beyond the reservations 
that one may have concerning the variable quality of the output of such software, the 
use of clean original electrophysiological signals makes safer the more specific study 
of precise neuronal synchronization. Neuronal data along with behavioral events (oc- 
currences of signals and performance of the animal) were stored on a PC for off-line 
analysis with a time resolution of 1 kHz. 

Two sets of data are here considered. They both correspond to a probability of 0.3 
that the response signal occur at 1 . 1 s for the monkey, but only correct trials where 
the response signal occurs at 1.7 s are considered. On both data sets, two neurons 
have been recorded simultaneously over [0, Imax] where Tmax is approximately two 
seconds. In the sequel. Data Set A (respectively B) corresponds to the pair of neurons 
(Nl A, N2A) (respectively to the pair of neurons (NIB, N2B)). In Data Set A (respec- 
tively B), 177 trials (respectively 141 trials) are considered. Figure 4 plots the rasters 
associated with both data sets. However, because 6 different directions of movement 
were proposed to the monkey, we can also consider in both data sets, 6 subsets of 
trials, each subset corresponding to a prescribed direction of movement (see Table 1 
for a repartition of the number of trials per direction). 
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Fig. 4 Raster plots of the data sets. In panel a the rasters associated to Data Set A i.e. (NIA, N2A). In 
panel b, the ones associated to Data Set B, i.e. (NIB, N2B). Each line corresponds to a trial, each dot to a 
spike 



Table 1 Repartition of number 
of trials on the real data sets 





Direction 










Total 




1 


2 


3 


4 


5 


6 




Data Set A 


28 


31 


30 


35 


28 


25 


177 


Data Set B 


23 


24 


26 


18 


30 


20 


141 



Therefore, n, the total number of trials will be close to 200 if one aggregates over 
all the directions or will belong to the interval [20, 35] if one considers the trials 
according to the directions. Those trials are assumed to be i.i.d. This assumption is 
more reasonable if one considers trials for a fixed given direction. 

4.7.2 Simulated Data 

To assess the performance of our procedure, simulated data for which the underly- 
ing model is known have also been simulated. Three different data sets have been 
simulated, with the thinning method [63]: 

• (S-HomPoi) Spikes are distributed according to an homogeneous Poisson pro- 
cesses of intensity 20 Hz on [0, 2] s. 

• (S-InPoi) Spikes are distributed according to an inhomogeneous Poisson processes 
with piecewise continuous intensity on [0, 2] s given by 

t ^ x(t) = j^lsi + /.,e-4*(-^')V(^^-(-^'>^)]i,,[,._,,,+,), 

(=1 

with g = [5,30,0], h = [12.5,15,12.5], c = [0.375,1.25,1.825], and r = 
[0.375,0.5,0.125]. 

• (S-Haw) Two spike trains are simulated according to a bivariate Hawkes process 
observed on [0,2]. Each process is respectively denoted A'^*^^^ and A'^^^^. Their 
intensities are given by (17) with spontaneous parameters y^'^ = v*^^' = 20 Hz 
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Table 2 /^-values of the chi-square test of the Poissonian distribution for the number of spikes per trial. 
The following code is used: o corresponds to a p-value of the test by upper values in [10""^, 10~'), A 
to a p-value of the test by upper values in [10"-', 10~^), AA to a p-value of the test by upper value in 
[10~^, 10~^), A A A to a p-value of the test by upper value in (— oo, 10~^). The signs are filled in black if 
the p-values correspond to rejection of a Benjamini and Hochberg (BH) multiple test method [64] either 
on the simulated data (left part of the table) or on Data Sets A and B (right part of the table) 



n Directions Pooled 





40 200 


1 


2 3 4 


5 


6 


S-HomPoi 


NIA 


o 


o • 


A 


AAA 


S-lnPoi 


N2A 




AA 


o 


o AAA 


S-Haw(W(l)) 


NIB 


o 


o o 




AAA 


S-Haw(W(2)) 


o N2B 


o 


• 




AAA 


and interaction functions h^^-' — h^2^ 




-20 X l[o,o.oo5], hj ^ 


= 60 X 


l[0,o.oi] and 



/zf) = 0. 

Each time a n i.i.d. sample is drawn. 

The several treatments have been done in R except Willett and Nowak's estimate 
(WN) for which Matlab has been used. 

4.2 Results 

4.2.1 Checking the Homogeneous Poisson Assumption 

One of the simplest tests is to check whether the number of spikes per trial obeys a 
Poisson distribution with unknown parameter. Since the Poisson distribution is dis- 
crete, one can use a chi-square test with one estimated parameter, whose results are 
summarized in Table 2. Since the number of spikes per trial is also a Poisson vari- 
able for inhomogeneous Poisson processes, it is reasonable to have large p-values 
for (S-HomPoi) and (S-InPoi). Relatively smaller p-values appear for (S-Haw), but 
they are not small enough for a clear rejection: This test seems therefore not very 
powerful. Note that a very close test has been used in [8] on disjoint intervals of 
observations. The estimate of the parameters were computed via a prescribed para- 
metric model and, therefore, the procedure was testing both the Poisson assumption 
and the parametric assumption. To our knowledge, chi-square tests cannot be adapted 
to a nonparametric plug-in. On Data Sets A and B, the most undoubtedly rejection 
appears for the pooled data, which can be explained by the fact that those data are not 
i.i.d. 

The second simplest test (as prescribed by Yana et al. [1] for instance) is the clas- 
sical Kolmogorov-Smirnov test of uniformity performed on all the spikes, once all 
the trials have been aggregated, which relies on the fact that conditionally to the total 
number of observed spikes the points of a homogeneous Poisson process should obey 
a uniform distribution. Table 3 shows the corresponding p-values. It is quite coherent 
to have large p-values for (S-HomPoi) and very small p-values for (S-InPoi) because 
of the lack of stationarity of the later. Since (S-Haw) is stationary, aggregation dilutes 
the dependence and it explains that this test is not powerful and that the p-values are 
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Table 3 p-values of the classical K.S. test of uniformity on the spikes, all trials being aggregated. Same 
codes as in Table 2. Note that none of the p-values were close enough to 1 to force a rejection by the test 
by lower values (i.e., rejection when the test statistic is smaller than k„ oi, which corresponds to p-values 
of the test by upper values larger than \ — a) 



n Directions Pooled 





40 


200 




1 


2 


3 


4 


5 


6 




S-HomPoi 






NIA 


AAA 


AAA 


AAA 


• 


AAA 


AA 


AAA 


S-InPoi 


AAA 


AAA 


N2A 


AAA 


AAA 


AAA 


AAA 


AAA 


AAA 


AAA 


S-Haw(W(l)) 






NIB 


AAA 


AAA 


AAA 


AAA 


AAA 


AAA 


AAA 


S-Haw(W(2)) 






N2B 


AAA 


AAA 


AAA 


AAA 


AAA 


A 


AAA 



Table 4 p-values of Test 1 on the ISI, with a subsample size [["tot ]"^^^J ■ where nj^' is the total number 
of ISI that have been observed in all the trials. Same codes as in Table 2 for the p-values. Note that none 
of the p-values were close enough to 1 to force a rejection by the test by lower values (i.e., rejection when 
the test statistic is smaller than k^^a , which corresponds to p-values of the test by upper values lai'ger than 
1-a) 



n Directions Pooled 





40 


200 




1 


2 


3 


4 


5 


6 




S-HomPoi 






NIA 


AAA 


AAA 


A 


• 




o 


AAA 


S-InPoi 


AA 


AAA 


N2A 


AAA 


AAA 


AAA 


A 


AAA 


AAA 


AAA 


S-Haw(A'(l)) 


o 


A 


NIB 


AAA 


AAA 


AAA 


• 


AAA 


AAA 


AAA 


S-Haw(A'(2)) 


A 


AAA 


N2B 


• 


AAA 


A 


o 


AA 


A 


AAA 



large in this case. This test clearly rejects for Data Sets A and B the homogeneous 
Poisson hypothesis. 

Another test of first-order statistics as explained in [1] is the exponential test on 
the ISI (see Table 4). Here, we apply our version of the exponential test, i.e.. Test 1, 
using the subsampling scheme. As soon as there is enough trials, this test is power- 
ful enough to detect the nonstationarity (S-InPoi), but also the dependence (S-Haw), 
since there is no aggregation to dilute the dependence between consecutive points in 
one trial with respect to the previous test. On Data Sets A and B, the result is almost 
the same as the previous test. 

4.2.2 Checking the Inhomogeneous Poisson Assumption 

Let us first look at the reconstructions of the underlying intensities. We first program 
two very basic kernel estimates (10): the sliding window (i.e., K — (l/2)l[_i_i]) 
with length 0.1 s (i.e., bandwidth 0.05), denoted A^** , and the Gaussian kernel (used 
for instance in [8]) with the same bandwidth and denoted . As [8], we also pro- 
grammed a data-driven choice of bandwidth, called the thumb rule (here we followed 
[39] to construct it): it is denoted xf''* ■ Finally, we programmed the three presented 
adaptive method. The GL estimate, is programmed with the Gaussian kernel. 
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Fig. 5 Reconstructions of the intensity k on simulated Poisson data. Reconstructions for (S-HomPoi) 
and (S-lnPoi) over n trials, observed on [0, 2]. Th corresponds to the adaptive histogram xj'', GL to the 
Goldenshluger and Lepski's method aJ^^ with the Gaussian kernel, Kh to a fixed bandwidth h = 0.05 for 
the Gaussian kernel, SWh to the sliding window with h = 0.05, WN to the Willett and Nowak's method and 
Kh* to the rule of the thumb. Panel a corresponds to (S-HomPoi) with n = 40, panel b to (S-InPoi) with 
n = 40 and panel c to (S-lnPoi) with n = 200 

with the choice rj = 0.5 and with the bandwidths family 

n^{D-^ : D = 4, 5,6,7, 8,9, 10, 11, 12, 14, 16, 18,20, 22, 25,30,35,40, 45, 50}, 

which has shown a robust behavior on various simulations, for the present considered 
size of n (n ^ 40). The thresholding estimate on the Haar basis, X],^, has been per- 
formed with 70 = 15 and y = 1. We use the Matlab package FreeDegree for the 
Willett and Nowak estimate, A^^. Reconstructions for (S-HomPoi) and (S-InPoi) are 
given in Fig. 5. 

First of all, A^^* is clearly the worst choice, as expected for such a rough ker- 
nel. Figure 5 a shows the reconstruction of a constant intensity. Kernel estimates with 
Gaussian kernels are oscillating; the thumb rule bandwidth h* is larger than h, the 
GL bandwidth. The fixed bandwidth h = 0.05 is the smallest one and is quite inad- 
equate in this setting. The adaptive Haar thresholding rule Xj^ is much better in this 
case. The WN method is able to reconstruct perfectly the flat line. For (S-InPoi), the 
intensity has large jumps and smooth bumps (Figs. 5b and 5c). For such an irregular 
intensity and for a small number of trials (Fig. 5b), the thresholding estimate and the 
WN method are both able to recover the jumps perfectly but the smooth bumps are 
estimated by a piecewise constant function. The Gaussian kernels are better for the 
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Fig. 6 Variability of the reconstructions of the intensity A on simulated Poisson data. Superposition of 20 
reconstructions by Willett and Nowak's method for (S-InPoi) over n trials, observed on [0, 2]. In panel a, 
n = 40. In panel b, n = 200 



estimation of the bumps, but of course, they cannot detect the jumps. In this respect, 
the GL bandwidth is the best, whereas A„ * and k„ '' are too smooth. For a large 
number of trials (Fig. 5c), the thresholding estimate is a bit refined but clearly suffers 
from a lack of smoothness. Unlike A„ ** and A,f * , is reconstructing all the three 
bumps. The WN method is reconstructing more accurately the jumps despite some 
important boundary artefacts. It also gives smoother reconstructions for the bumps. 
In conclusion, the GL method clearly gives a bandwidth choice that adapts to high 
irregularity of the intensity with respect to other choices, whereas the thresholding 
estimate, which leads to an adaptive histogram, is more spatially adaptive despite its 
lack of smoothness. Up to boundary effects, the WN methodology seems to be the 
most accurate, since it adapts to the regularity of the underlying intensity. Note, how- 
ever, that on an interval with a few number of points, this method provides a piece- 
wise constant reconstruction, even if the underlying intensity is smooth, because this 
choice is more robust. This conclusion is also coherent with two previous and more 
extensive studies (see [43, 44]). 

To understand more clearly the variability of WN estimates, we have drawn 20 
reconstructions in Fig. 6, which confirms that the variability is small and diminishes 
when n grows. The estimates are more likely to be piecewise constant for small n. 
For large «, there are some edge effects around the jumps, but the amplitude around 
the bumps is very small thanks to a smooth estimate in those parts of the curve. 

As an example of estimation on real data. Fig. 7 displays an interesting case, 
namely N2A in direction 1 , where one sees clearly that WN estimate is able to find 
correctly the main jump in the intensity (which is clearly seen in the raster plots of 
Fig. 4), but also to produce a smoother estimate at some places (here a straight de- 
creasing line on the right) to reproduce the much slower attenuation of the firing rate 
that can be seen in the raster plots. 

Now let us apply the different proposed tests. First, Test 2 has been applied on 
the simulated data sets with n — 40 and p — [n^/^J and nothing was declared signifi- 
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Fig. 7 Example of estimation of the intensity A on real data. Estimations for neuron N2A, in direction 1 . 
Same conventions as in Fig. 5 

cant. However, the p-values corresponding to (S-Haw) are abnormally large. To take 
this into account, we have performed also a variant of Test 2 where one rejects if the 
same test statistic is now smaller than fc„ This test consists therefore in rejecting 
the Poisson hypothesis when both estimated distributions are too close. As Test 2, 
this test is also asymptotically of level a, by application of Proposition 2. Actually, 
all the tests presented here can be said to reject "by upper values" and have therefore 
a version "by lower values." On each set (simulated or not) of Test 2 ;?-values (by 
upper and lower values), one can perform a Benjamini and Hochberg procedure with 
FDR 5 % [64]. It declares both processes in (S-Haw) as non-Poissonian in the fam- 
ily of simulated data (p-values in [10~^, 10~^)). Due to the high variability of the 
reconstructions on Data Sets A and B, which depend on the considered direction, it 
was not possible to pool the data together and, therefore, the corresponding tests have 
been performed direction by direction. However, Tests 2 by upper or lower values do 
not detect anything on Data Sets A and B. 

Tests 3 (by upper or lower values) do not detect anything on the simulated data 
sets. However, on Data Sets A and B (see Table 5), Test 3 clearly rejects the Poisson 
hypothesis for most of the directions of NIB and in this sense. Test 3 is more powerful 
than Test 2. 

Test 5 is, as expected, more powerful than Tests 2 and 3 and detects the non- 
Poissonian structure in (S-Haw) (see Table 6). More importantly, on Data Sets A and 
B, all the p-values of Test 5 (by upper values) are smaller than lO"^"*^ making clear 
that those data are not inhomogeneous Poisson processes. 

4.2.3 Checking the Hawkes Assumption 

Once again. Data Sets A and B have to be treated direction by direction because of 
the high variability of the reconstruction. However, because of the very small number 
of trials per direction, it was not possible to look at very small subintervals {T\ , 72]. 
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Table 5 p-values of Test 3 on 

Data Sets A and B, with a Directions 
subsample size [n'^^^], where ii 
is the number of trials. Same 
codes as Table 2 for the p-values 
of the test by upper values. For 
the p-values of the test by lower 
values, same codes except that o 
becomes □ and A becomes V 
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Table 6 p-values of Test 5 on 

the simulated data sets, with 

H = 40 and a subsample size 

[fi^/^ J . Same codes as in S-HomPoi 

Tables 2 and 5 for the p-values S-InPoi 

S-Haw (A?('>) 
S-Uaw{N'^'l) 



Th GL WN 



Therefore, we decided to look at the largest interval that one can take i.e. [K8, 2]. We 
choose K — S — 0.005 and y — I. On simulated data (Fig. 8), and recover 
the support of the interaction functions and also find that h\ —0, which could not 
have been possible with a classical least-square estimate (see also more comments on 
the functional connectivity graph in [58]). Moreover, Z^*-* is less biased than as 
expected. We provide one estimation of the interaction functions for Data Set A in 
direction 2 (Fig. 9), where it clearly appears a one way excitation of N2A on NIA. 
This is coherent with a previous study on the same data set, which finds this pair 
of neurons dependent through a complete different method [22]. Note also that at 
short range the self-interaction functions are negative, fact which is consistent with 
refractory periods. 

Test 4 is used to check the Hawkes assumption, with / given either by or 
(see Table 7). It is coherent to find that, on simulated data, the Hawkes assumption is 
accepted for (S-Haw), but also for (S-HomPoi), which is a particular case of Hawkes 
process with null interaction functions. On the contrary, Test 4 detects that (S-InPoi) 
is not a Hawkes process. On Data Sets A and B, the Hawkes assumption is sometimes 
accepted on the whole interval of observations. Let us focus for instance on NIA in 
direction 2 (see also Fig. 9). This process could not be a Poisson process, since the 
p-value of Test 5 is smaller than 10~'^. However, the large p-value for Test 4 com- 
bined with Fig. 9 means that this process can be explained via a pure self-interaction 
process, which is negative at short range (refractory period) and positive at larger 
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Fig. 8 Reconstruction of the interaction functions on simulated data. Reconstructions for (S-Haw) with 
n = 40 trials, with Ti = 0.05 and T2 = 2. On the upper left, the interaction function /I'j'', on the upper 
right, h2 \ on the bottom left h^^^ and on the bottom right h^\ In black, the true interaction functions, in 
green, reconstruction by method f^, in red, reconstruction by method f^'^ . On the top of each graphics. 



the true spontaneous parameter ' on the top and the true parameter ' on the bottom are referred by 
S*. The estimated spontaneous parameters by method f^ are referred as SB and the estimated spontaneous 



parameters by method f^'^ are referred as SBO. We used K -- 



-- 0.005, and y = 1 



range. Note that 9 processes on the 24 are not explained in this way (in particular 
N2A in direction 2), and are not Poisson either: Ihis can be due to a too large lack of 
stationarity, combined with a large dependence between points. 



5 Conclusion 

When using the time-rescaling theorem to assess whether an observed spike train 
obeys a certain probabilistic model (e.g., Poisson, Hawkes, etc.), a plug-in step is 
currently performed [1-3, 8, 24]. If this plug-in step is done without care the result- 
ing test may be much too conservative leading to poor detections (see Fig. 1). We 
propose here to use the subsampling as an almost universal solution when dealing 
with Kolmogorov-Smirnov tests of uniformity, such a universal solution being com- 
pletely new. The main requirement is to have access to an estimate of the underlying 
intensity, whose rate of convergence is known (see, for instance, (5)). 

In classical previous works such as [8, 23], parametric estimates such as MLE over 
a prescribed parametric model are used as plug-in estimates, which are converging 
toward the true intensity when the underlying parametric model is true. Therefore, 
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Fig. 9 Example of estimation of the interaction functions on real data. Estimations for Data Set A in 
direction 2 with n = 40 trials, with T"; = 0.05 and T2 = 2. Same conventions as in Fig. 8 



Table 7 p-values of Test 4 with a subsample size L'l'^^'^J ■ Same codes as in Tables 2 and 5 for the p-values 
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when this parametric plug-in is used inside omnibus tests (and in particular the ones 
presented here), one has to be careful that we are not only testing the probabilis- 
tic assumption, but also the fact that the intensity belongs to the parametric model. 
To overcome this problem, we advertise for the use of nonparametric estimates and 
more precisely to adaptive estimates, for which rates of convergence are known under 
lighter assumptions on the intensity than a prescribed parametric assumption. More- 
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over, those adaptive methods have very good practical performance making them also 
very good in practice for estimation. 

On some simulated data and on some real data sets, we have shown that our 
method performs very well. However, there are still two main directions in which 
our work need to be pursued in order to provide a more complete answer on real data 
sets. First of all, the KS test of exponentiality on the ISI can also be performed instead 
of the KS test of uniformity [1]. If the method with subsampling can be adapted to 
this case, we have presently no guarantee that this test would have a controlled level. 
Indeed we have no equivalent of Proposition 2 or Theorem 1 for the ISI repartition. 
The second main drawback is that we are clearly able to reject (at least on the pre- 
sented real data sets) both homogeneous and inhomogeneous Poisson assumptions. 
We are also able to test whether the processes are Hawkes or not, which in particular 
takes into account refractory periods and dependence between several spike trains. 
However, the Hawkes model reflects stationary features, and cannot model nonsta- 
tionary data. Therefore, we would need a more general model, which includes the 
dependence as in the Hawkes model presented here, but also the non- stationarity 
as in the inhomogeneous Poisson process. At the present moment, models reflecting 
both are not compatible with a full agnostic approach where no assumption is made 
on the underlying functions, the estimation problem being not completely identifiable 
[16]. A first step will be therefore to provide a trade off between estimation capacity 
and not too restrictive assumptions on the process itself, with respect to real spike 
train data. 
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